{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)\n",
    "\n",
    "The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.\n",
    "\n",
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well  as cuDF DataFrames.\n",
    "\n",
    "For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).\n",
    "\n",
    "For information about cuML's DBSCAN implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.DBSCAN."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from cuml.datasets import make_blobs\n",
    "from cuml.cluster import DBSCAN as cuDBSCAN\n",
    "from sklearn.cluster import DBSCAN as skDBSCAN\n",
    "from sklearn.metrics import adjusted_rand_score\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 10**4\n",
    "n_features = 2\n",
    "\n",
    "eps = 0.15\n",
    "min_samples = 3\n",
    "random_state = 23"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "device_data, device_labels = make_blobs(n_samples=n_samples, \n",
    "                                        n_features=n_features,\n",
    "                                        centers=5,\n",
    "                                        cluster_std=0.1,\n",
    "                                        random_state=random_state)\n",
    "\n",
    "device_data = cudf.DataFrame.from_gpu_matrix(device_data)\n",
    "device_labels = cudf.Series(device_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy dataset from GPU memory to host memory.\n",
    "# This is done to later compare CPU and GPU results.\n",
    "host_data = device_data.to_pandas()\n",
    "host_labels = device_labels.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "clustering_sk = skDBSCAN(eps=eps,\n",
    "                         min_samples=min_samples,\n",
    "                         algorithm=\"brute\",\n",
    "                         n_jobs=-1)\n",
    "\n",
    "clustering_sk.fit(host_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "clustering_cuml = cuDBSCAN(eps=eps,\n",
    "                           min_samples=min_samples,\n",
    "                           verbose=True,\n",
    "                           max_mbytes_per_batch=13e3)\n",
    "\n",
    "clustering_cuml.fit(device_data, out_dtype=\"int32\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize Centroids\n",
    "\n",
    "Chart the resulting clusters from cuML's DBSCAN, where each color represents one cluster found by the algorithm and black points are those not assigned to any cluster. (Unlike many clustering algorithms, DBSCAN can label some outlier points as \"noise\" that do not belong to a cluster.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(16, 10))\n",
    "\n",
    "X = np.array(host_data)\n",
    "labels = clustering_cuml.labels_\n",
    "\n",
    "n_clusters_ = len(labels)\n",
    "\n",
    "# Black removed and is used for noise instead.\n",
    "unique_labels = labels.unique()\n",
    "colors = [plt.cm.Spectral(each)\n",
    "          for each in np.linspace(0, 1, len(unique_labels))]\n",
    "for k, col in zip(unique_labels, colors):\n",
    "    if k == -1:\n",
    "        # Black used for noise.\n",
    "        col = [0, 0, 0, 1]\n",
    "\n",
    "    class_member_mask = (labels == k)\n",
    "\n",
    "    xy = X[class_member_mask]\n",
    "    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),\n",
    "             markersize=5, markeredgecolor=tuple(col))\n",
    "\n",
    "plt.title('Estimated number of clusters: %d' % n_clusters_)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Results\n",
    "\n",
    "Use the `adjusted_rand_score` to compare the two results, making sure the clusters are labeled similarly by both algorithms even if the exact numerical labels are not identical. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "sk_score = adjusted_rand_score(host_labels, clustering_sk.labels_)\n",
    "cuml_score = adjusted_rand_score(host_labels, clustering_cuml.labels_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = (cuml_score - sk_score) < 1e-10\n",
    "print('compare dbscan: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
